Add missing indexes on dag_version_id columns for db cleanup performance#60307
Add missing indexes on dag_version_id columns for db cleanup performance#60307GaneshPatil7517 wants to merge 7 commits intoapache:mainfrom
Conversation
Adds indexes on task_instance.dag_version_id and dag_run.created_dag_version_id to speed up the airflow db clean command when cleaning dag_version records. Without these indexes, the cleanup operation performs full table scans on both tables for each batch, causing ~6 minutes per batch on tables with 300K+ rows. With the indexes, the same operation completes in under 2 minutes total. Fixes apache#60145
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
|
- Renamed migration from 0098 to 0099 to chain after new upstream migration - Updated down_revision to e79fc784f145 (timetable migration) - Updated migrations-ref.rst with correct chain - Our migration 62fb1d0a1252 is now the new HEAD for 3.2.0
There was a problem hiding this comment.
Pull request overview
This pull request addresses a performance issue with the airflow db clean -t dag_version command by adding database indexes on foreign key columns that reference dag_version.id. Without these indexes, the cleanup process was performing full table scans resulting in ~6 minutes per batch on tables with 300K+ rows.
Changes:
- Added migration 0099 to create indexes on
task_instance.dag_version_idanddag_run.created_dag_version_id - Updated ORM models (TaskInstance and DagRun) to include the new indexes in their
__table_args__ - Updated the database revision head mapping to point to the new migration
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| airflow-core/src/airflow/utils/db.py | Updated the 3.2.0 revision head to point to the new migration |
| airflow-core/src/airflow/models/taskinstance.py | Added index on dag_version_id column to table_args |
| airflow-core/src/airflow/models/dagrun.py | Added index on created_dag_version_id column to table_args |
| airflow-core/src/airflow/migrations/versions/0099_3_2_0_add_dag_version_id_indexes_for_db_cleanup.py | New migration that creates the indexes with proper upgrade/downgrade logic |
| airflow-core/docs/migrations-ref.rst | Updated migration reference documentation to include the new migration as head |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...core/src/airflow/migrations/versions/0099_3_2_0_add_dag_version_id_indexes_for_db_cleanup.py
Outdated
Show resolved
Hide resolved
…g_version_id_indexes_for_db_cleanup.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
What does this PR do?
Adds database indexes on
task_instance.dag_version_idanddag_run.created_dag_version_idto fix the slow performance ofairflow db clean -t dag_versioncommand.Why is this needed?
When running
airflow db clean -t dag_version --batch-size 1000, the cleanup process was taking ~6 minutes per batch on tables with 300K+ rows. The root cause was missing indexes on the foreign key columns that referencedag_version.id.The cleanup code in
db_cleanup.pydefinesdag_versionwithdependent_tables=["task_instance", "dag_run"], meaning every delete operation needs to check both tables for FK violations. Without indexes, PostgreSQL performs full table scans.What changed?
New migration (
0098_3_2_0_add_dag_version_id_indexes_for_db_cleanup.py):idx_task_instance_dag_version_idontask_instance(dag_version_id)idx_dag_run_created_dag_version_idondag_run(created_dag_version_id)Updated ORM models: Added indexes to
__table_args__for schema consistencyPerformance Impact
db clean -t dag_version(300K rows, batch=1000)Fixes #60145